In the table above we don’t have any real missing data but we can see some problematic values (like the negative values for solar radiation). We assume that each negative value of solar radiation is due to some data trasmission error so we set all the negative value equal to zero, in addition we parse the data related to the hour and date to timedate format
DT[DT.Radiation .<0,:Radiation] .=0.0;DT.day =parse.(Int64,chop.(DT[:,"Date-Hour(NMT)"], head =0, tail =14))DT.month =parse.(Int64,chop.(DT[:,"Date-Hour(NMT)"], head =3, tail =11))DT.hour =parse.(Int64,chop.(DT[:,"Date-Hour(NMT)"], head =11, tail =3));rename!(DT,"Date-Hour(NMT)"=>"timedate");# Create column with right formattingDT.timedate_real =DateTime.(2017,DT.month,DT.day,DT.hour);DT.date_real =Date.(2017,DT.month,DT.day);
Let’s visualize the SystemProduction for each hour:
From the plot we notice that we have some multiple consecutive days where the production is zero (for example in Jan, May, Dec). It seems that instead of missing value we have some zero value when we don’t have available data. We need to be very careful while performing the data cleaning because during the night the actual production of the solar panel is zero. Let’s group by day and check the days with zero production
df =groupby(DT, :date_real)dt =combine(df, ["SystemProduction","WindSpeed","Sunshine","AirPressure","Radiation","AirTemperature","RelativeAirHumidity","month"] .=> [sum, mean, mean, mean, mean, mean, mean, mean]; renamecols =true);sort!(dt,:date_real);p1 =scatter(dt.Radiation_mean, dt.SystemProduction_sum, title ="Production vs Radiation (day)", label=:none)p1 =vline!([40], label=:none)p1 =hline!([1800], label=:none)p2 =scatter(DT.Radiation, DT.SystemProduction, title ="Production vs Radiation (hour)", label=:none)plot(p1, p2, layout=(1,2), size=(750,300))
From the plot is clear that we have some outlier where radiation is grater than 40 W/m2 and production is lower than 1800 kWh per day. Let’s mark those days as suspicious and check if we have hour that can have some suspiciuos data, then compute the correlation between variables in the clean dataset
suspect_day = dt[(dt.Radiation_mean .>40) .&& (dt.SystemProduction_sum .<1800),:date_real]filter!([:date_real, :SystemProduction, :Radiation] => (x,y,z) -> x ∉Ref(suspect_day) && !(y ==0&& z >40), DT)cor(Matrix(DT[:,2:8]))
Last step before the machine learning model is to include the time into a numerical variable using Cyclical Encoder. This methods allow us to take into account the time cyclicity for months, days, hours:
\[s(t) = \sum_{n=1}^{N} \left( a_n \cos\left(\frac{2\pi n t}{P}\right) + b_n \sin\left(\frac{2\pi n t}{P}\right) \right)\]
EvoTrees is a regression algorithm in Julia library for creating gradient boosting regression models. It allows you to build decision trees efficiently, focusing on performance. EvoTrees works by combining multiple weaker decision trees into a stronger final model. It supports various loss functions specifically designed for regression tasks, which guide the training process and evaluate how well your model performs. The library utilizes histogram-based algorithms for faster data processing and can also handle different types of features within your data, including categorical ones. Overall, EvoTrees provides a versatile toolkit for building regression models in Julia using gradient boosting.
Split the dataset in train and test considering only the hours with solar radiation grater than zero (exclude nights and evenings). We consider train from 01/01 to 06/30 and test from 07/01 to 12/31. We perform hourly estimation and then we group by day of the year. We include all the available variable into the model:
We need to load the EvoTreeRegressor algorithm, set the parameters, create the machine and cross validate the model using 5-folds repeating the operation for 5 times